Day04 - 端到端(end-to-end)語音辨識-attention 機制 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2021 iThome 鐵人賽

DAY 4

AI & Data

機器學習應用於語音相關服務系列第 4 篇

Day04 - 端到端(end-to-end)語音辨識-attention 機制

13th鐵人賽

pwhsiao

2021-09-16 15:20:45

2196 瀏覽

分享至

如果近幾年來有在關注深度學習技術發展的話，一定有聽過 attention model 以及 Attention Is All You Need 這篇非常有名的論文，論文的細節這邊就不多談了，網路都可以找的到相當豐富的說明、實作。這邊主要要探討的是如何使用 attention 解決之前提到的輸入序列過長會導辨識效果不佳的問題。

Attention 的核心概念可以以下這句話來描述:

Attention: At different steps, let a model "focus" on different parts of the input.

Attention 機制的加入，會讓 decoder 在解碼的過程中找出輸入序列中哪些部分較為重要，因此 encoder 不需要再將輸入序列壓縮成一固定大小的 context vector。

而 decoder 中的每一個 hidden state 都會有不同的 context vector，也就是說如果輸入序列有 N 個音框(frame)，就會產生 N 個 context vector。接下來我們用數學式子來表示 attention 的運作過程:

all encoder hidden states: $s_{1},s_{2},...,s_{m}$
decoder hidden state at timestamp t : $h_{t}$
Attention weights: 將目前時間點 (timestamp) decoder 的 hidden state $h_{t}$ 對所有 encoder 的 hidden state 進行 score function，再透過 softmax function 計算出 $h_{t}$ 對每一個時間點 $s_{i}$ 的重要程度
$https://chart.googleapis.com/chart?cht=tx&chl=%5Calpha_%7Bk%7D(t)%3D%5Cfrac%7Bexp(score(h_%7Bt%7D%2Cs_%7Bk%7D))%7D%7B%5Csum_%7Bi%3D1%7D%5E%7Bm%7Dexp(score(h_%7Bt%7D%2Cs_%7Bi%7D))%7D%2C%20%5C%20k%3D1..%20m$

其中score function 的運算方式有好幾種，包括 dot-product, bilinear, multi-layer perceptron 等
Context vector: 將 attention weights 與 encoder hidden state $h_{s}$ 進行 weighted sum

$c_{t}=\sum_{k=1}^{m}\alpha_{k}(t)s_{k}$